Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 27 14:13:34 2022"

The text continues here.

I hope everything is now installed correctly. Learning Git/Github seems useful and it is something that should prove to be useful in the future. I am a bit nervous about using R as I’m not very familar with it but we’ll see how this goes.

Thoughts about Exercise set 1

R really seems like a good tool for building graphs and plots and the logic seems pretty straightforward. However, it will take time to learn the syntax and all the commands. It is inevitable that learning new software feels at first like an avalanche of information, but this tutorial for R was nice in that it teaches one thing at a time.

my GitHub repository


Chapter 2) Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Nov 27 14:13:34 2022"

1) Reading data

lrn14 <- read.table("learning2014.csv", sep=",", header=TRUE)

# dimensions of the data
dim(lrn14)
## [1] 166   7
# structure of the data
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This is survey data from 2014 of Approaches to learning.This is a subset of original learning2014 data, from which 7 variables were picked:gender, age, attitude, deep, stra, surf and points. ‘Deep’(deep learning), ‘stra’ (strategic learning) and ‘surf’ (superficial learning) have been combined from their related items. You can find more information about the dataset from learning2014

2) Overview of data

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Looking at the data, there are more females than males. Age is the only variable that is clearly skewed (to the left), while other variables are more evenly distributed. Srongest correlations are found between deep learning and surface learning (-0.32), attitude and point (0.44). Additionally, surface leaning was correlated with strategic learning (-0.16) and attitude (-0.18). All these aforementioned correlations were in same direction in both sexes.

3) and 4) Linear regression model

model_1 <- lm(points ~ attitude + stra + surf, data = lrn14)

summary(model_1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
model_2 <- lm(points ~ attitude, data = lrn14)
summary(model_2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Initially, I chose attitude, strategic learning and surface learning as explanatory variables for points in linear regression model as they were the variables showing highest correlation with points. However, attitude was the only statistically significant predictor of points (Coeff 0.34, p-value <0.001), while the model explained 20% of variation in points (Adjusted R squared). After removing non significant variables from the model, attitude predicted points with Coeff 0.35 (p-value <0.001), while the model explained 19% of variation in points. Taken together, based on this model, attitude was the best predictor of points: better attitude predicted more points.

5) Diagnostic plots

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))
plot(model_2, which = c(1, 2, 5))

The main assumptions of Linear regression model are:
* linear relationship between response value and eplanatory value
* errors (residuals) have constant variance across all values of eplanatory variable
* errors are independent of each other
* errors have a normal distribution

Residuals vs fitted should not show a pattern where distribution of residuals varies along fitted values. In the figure above residuals are nicely evenly distributed.
QQ-plot should show a line if residuals are normally distributed. Figure above shows that the data is pretty much following the line excluding few outliers.
Residuals vs leverage should not show points outside Cook’s distance, which holds for the figure above.
Taken together, the assumptions of linear regression model hold true for the current model.


Chapter 3) Logistic regrsession

date()
## [1] "Sun Nov 27 14:13:45 2022"

Reading data

The data consists of student performance data in two Portuguese schools, collected by reports and questionnaires. It consists of two datasets from two different subjects: math and Portuguese. These datasets were combined and the variables of the combined dataset are printed below. In addition, two variables were calculated: alc_use is the average alcohol use from weekdays and weekends and high_use tells is true when alc_use is over 2

More information (including variable information) and the original dataset can be found from UCI_machine_learning_repository

alc <- read.csv("C:/Users/labpaavo/IODS-project/data/alc.csv")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Variables of interest in relation to alcohol consumption and hypotheses

For the variables of interest in relation to high/low alcohol consumption I have chosen: sex, Pstatus, studytime and absences. The hypotheses are that:
- being a male predicts high alcohol consumption
- parents living apart predicts high alcohol consumption
- low studytime predicts high alcohol consumption
- high absences predicts high alcohol consumption

Exploring the chosen variables

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)

library(descr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

my_variables = c("high_use", "sex", "Pstatus", "studytime", "absences")
my_data <- alc[my_variables]

summary(my_data)
##   high_use           sex              Pstatus            studytime    
##  Mode :logical   Length:370         Length:370         Min.   :1.000  
##  FALSE:259       Class :character   Class :character   1st Qu.:1.000  
##  TRUE :111       Mode  :character   Mode  :character   Median :2.000  
##                                                        Mean   :2.043  
##                                                        3rd Qu.:2.000  
##                                                        Max.   :4.000  
##     absences     
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 3.000  
##  Mean   : 4.511  
##  3rd Qu.: 6.000  
##  Max.   :45.000
CrossTable(my_data$high_use, my_data$sex)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================
##                     my_data$sex
## my_data$high_use        F       M   Total
## -----------------------------------------
## FALSE                 154     105     259
##                     2.244   2.500        
##                     0.595   0.405   0.700
##                      0.79    0.60        
##                     0.416   0.284        
## -----------------------------------------
## TRUE                   41      70     111
##                     5.235   5.833        
##                     0.369   0.631   0.300
##                      0.21    0.40        
##                     0.111   0.189        
## -----------------------------------------
## Total                 195     175     370
##                     0.527   0.473        
## =========================================
CrossTable(my_data$high_use, my_data$Pstatus)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================
##                     my_data$Pstatus
## my_data$high_use        A       T   Total
## -----------------------------------------
## FALSE                  26     233     259
##                     0.014   0.002        
##                     0.100   0.900   0.700
##                     0.684   0.702        
##                     0.070   0.630        
## -----------------------------------------
## TRUE                   12      99     111
##                     0.032   0.004        
##                     0.108   0.892   0.300
##                     0.316   0.298        
##                     0.032   0.268        
## -----------------------------------------
## Total                  38     332     370
##                     0.103   0.897        
## =========================================
CrossTable(my_data$high_use, my_data$studytime)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================================
##                     my_data$studytime
## my_data$high_use        1       2       3       4   Total
## ---------------------------------------------------------
## FALSE                  56     128      52      23     259
##                     2.314   0.017   2.381   0.889        
##                     0.216   0.494   0.201   0.089   0.700
##                     0.571   0.692   0.867   0.852        
##                     0.151   0.346   0.141   0.062        
## ---------------------------------------------------------
## TRUE                   42      57       8       4     111
##                     5.400   0.041   5.556   2.075        
##                     0.378   0.514   0.072   0.036   0.300
##                     0.429   0.308   0.133   0.148        
##                     0.114   0.154   0.022   0.011        
## ---------------------------------------------------------
## Total                  98     185      60      27     370
##                     0.265   0.500   0.162   0.073        
## =========================================================
my_data %>% group_by(high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 2 × 3
##   high_use count mean_absences
##   <lgl>    <int>         <dbl>
## 1 FALSE      259          3.71
## 2 TRUE       111          6.38
g1 <-  ggplot(data = my_data, aes(x = sex))
g1 + geom_bar() + facet_wrap("high_use") 

g2 <-  ggplot(data = my_data, aes(x = Pstatus))
g2 + geom_bar() + facet_wrap("high_use") 

g3 <-  ggplot(data = my_data, aes(x = studytime))
g3 + geom_bar() + facet_wrap("high_use") 

g4 <-  ggplot(data = my_data, aes(x = absences))
g4 + geom_bar() + facet_wrap("high_use") 

Looking at data, it seems that first of all there are more students with low alcohol consumption (259) compared to high- (111). Absences is clearly skewed to left (less absences). It seems that high consumption group has more males (63%) compared to low consumption (41%). Parental status seems to be similarly distributed between the grpoups. In regards to studytime, high consumption group has notably higher frequency of students who study very little (37.8% vs 21.6%). Finally, high consumption group has higher mean absences (6.4), compared to low consumption group (3.7). Taken together, it seems that sex, studytime and absences could maybe predict alcohol consumptions, whereas parental status seems to be pretty evenly distributed between groups and thus will unlikely predict alcohol consumption.

Logistic regression

m <- glm(high_use ~ sex + Pstatus + studytime + absences, data = my_data, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + Pstatus + studytime + absences, 
##     family = "binomial", data = my_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1339  -0.8389  -0.5966   1.0728   2.1221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.01835    0.55916  -1.821 0.068574 .  
## sexM         0.84663    0.25509   3.319 0.000903 ***
## PstatusT     0.12379    0.40420   0.306 0.759415    
## studytime   -0.41526    0.16023  -2.592 0.009550 ** 
## absences     0.08928    0.02345   3.807 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 408.40  on 365  degrees of freedom
## AIC: 418.4
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3611909 0.1172496 1.0606103
## sexM        2.3317700 1.4202152 3.8687722
## PstatusT    1.1317735 0.5251790 2.5945605
## studytime   0.6601657 0.4776116 0.8967295
## absences    1.0933826 1.0465506 1.1475274

Logistic regression models showed that sex, studytime and absences were significant predictors high/low alcohol consumptions:
sexM log odds .85 (p-value <0.001) studytime log odds -.42 (p-value < 0.01), meaning that for unit increase in studytime the log odds for being n high alcohol consumption decrease by 0.42.
absences log odds .09 (p-value <0.001), meaning that for each absence the log odds for high alcohol consumption changes by 0.09.

The odds for a male being in high alcohol consumption groups over a female is 2.3 (CI 1.4; 3.9). For unit increase in studytime the odds of being in high alcohol consumption group decrese by 34% (CI 0.48; 0.90), note that studytime had four possible values corresponging to <2 hours, 2 to 5 hours, 5 to 10 hours and >10 hours. For each absence the odds of being in high alcohol consumption group increase by 9.3% (CI 4.7%; 14.8%).

Against the initial hypothesis parental status did not predict alcohol consumption, however this was could be predicted already by exploring the data. The effect of sex, studytime and absences were as hypothesized as being a male, lower studytime and higher absences predicted being in high alcohol consumption group.

Predictive power of the model

# drop parental status from the model as it wasn't statistically significant predictor
new_model <- glm(high_use ~ sex + studytime + absences, data = my_data, family = "binomial")

probabilities <- predict(new_model, type = "response")
my_data <- mutate(my_data, probability = probabilities)
my_data <- mutate(my_data, prediction = (probability > 0.5))

table(high_use = my_data$high_use, prediction = my_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250    9
##    TRUE     86   25
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# training error 
loss_func(class = my_data$high_use, prob = my_data$probability)
## [1] 0.2567568
g <- ggplot(my_data, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

The model resulted in 250 true negatives, 25 true positives, 86 false negatives and 9 false positives.
The training error of the model is 25.7%.

10-fold cross-validation

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = my_data$high_use, prob = my_data$probability)
## [1] 0.2567568
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2648649

The test set error of my model with 10-fold cross-validation is ~0.27, which is slightly worse than the model introduced in Exercise set.

Chapter 4) Clustering

date()
## [1] "Sun Nov 27 14:13:48 2022"

Reading data and exploring it

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(ggplot2)
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston data consists of 14 variables (and 506 observations) and it is about housing values in suburbs of Boston. Details and full descriptions of the variables can be found from Boston.

Graphical overview of the data

#summaries of the variable data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

#library(Hmisc)
boston_df <- as.data.frame(Boston)
#hist.data.frame(boston_df) this plot gives an error while knitting the index file so unfortunately can't display it


cor_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The figure of pairs function was so difficult to read on my laptop, so I ended up drawing also histograms to look at the distributions of the variables (unfortunately it can’t be displayed while knitting index because of the plot size, code commented out). Many variables are heavily skewed. The only variables close to normal distribution seem to be ‘average number of rooms per dwelling’(rm) and ‘median value of owner-occupied homes in $1000s’ (medv).

The correlation plot shows the relationships between the variables. Strongest negative correlations can be found between:
* weighted mean of distances to five Boston employment centres (dis) and proportion of owner-occupied units built prior to 1940 (age)
* dis and nitrogen oxides concentration (nox)
* dis and proportion of non-retail business acres per town (indus)
* lower status of the population percent (lastat) and median value of owner-occupied homes in $1000s (medv)

Strongest positive correlation is between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax)

Standardize the dataset

boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 1)

Predict classes with LDA model

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) 
##           predicted
## correct    low med_low med_high high
##   low       15       9        1    0
##   med_low    4      18        4    0
##   med_high   0       8       17    0
##   high       0       0        1   25

LDA model seems to predict classes quite well with accuracy of ~75% (varying between seeds). The model performed best in classifying high crime rates.

k-means clustering

# reloading the dataset
library(MASS)
data("Boston")

# standardizing the dataset
boston_scaled2 <- as.data.frame(scale(Boston))

# distances between observations
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Looking for optimal number of clusters

library(ggplot2)
set.seed(13)

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the value of total WCSS (y-axis) changes radically, here it seems to be around two clusters.

# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster) 

pairs(boston_scaled2[7:14], col = km$cluster) 

I plotted the clusters in two separate plots, since the figures are otherwise two small for me to see in laptop, but this way I don’t see all the pairs. Overall it looks like two clusters works nicely in most of the pairs.

Super-Bonus:

matrix product

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# color is the crime classes of exercise set: classes is determined earlier as classes <- as.numeric(train$crime) 
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)